5  Introduction to R and RStudio

5.1 What is R?

  • Open source (free!) statistical programming language/software

  • It can be used for:

    • Working with data - cleaning, wrangling and transforming
    • Conducting analyses including advanced statistical methods
    • Creating high-quality tables & figures
    • Communicate research with R Markdown
  • It is constantly growing!

  • Has a strong online support community

  • Since it’s one programming language, it is versatile enough to take you from raw data to publishable research using free, reproducible code!

5.2 What is RStudio?

  • RStudio is a free, open source IDE (integrated development environment) for R. (You must install R before you can install RStudio.)

  • Its interface is organized so that the user can clearly view graphs, tables, R code, and output all at the same time.

  • It also offers an Import-Wizard-like feature that allows users to import CSV, Excel, SPSS (*.sav), and Stata (*.dta) files into R without having to write the code to do so.

5.3 R versus Others Softwares

Excel and SPSS are convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However, they are a poor choice for statistical analysis beyond the simplest descriptive statistics, or for more than a very few columns.

Proportion of articles in health decision sciences using the identified software

5.4 Why should you learn R

  • R is becoming the “lingua franca” of data science
  • Most widely used and it is rising in popularity
  • R is also the tool of choice for data scientists at Microsoft, Google, Facebook, Amazon
  • R’s popularity in academia is important because that creates a pool of talent that feeds industry.
  • Learning the “skills of data science” is easiest in R

Increasing use of R in scientific research

Some of the reasons for chosing R over others are are:

  • Missing values are handled inconsistently, and sometimes incorrectly.
  • Data organisation difficult.
  • Analyses can only be done on one column at a time.
  • Output is poorly organised.
  • No record of how an analysis was accomplished.
  • Some advanced analyses are impossible

5.5 Health Data Science

Health Data Science is an emerging discipline, combining mathematics, statistics, epidemiology and informatics.

R is widely used in the field of health data science and especially in healthcare industry domains like genetics, drug discovery, bioinformatics, vaccine reasearch, deep learning, epidemiology, public health, vaccine research, etc.


Applications of Data Science in Healthcare

As data-generating technologies have proliferated throughout society and industry, leading hospitals are trying to ensure this data is harnessed to achieve the best outcomes for patients. These internet of things (IoT) technologies include everything from sensors that monitor patient health and the condition of machines to wearables and patients’ mobile phones. All these comprise the “Big Data” in healthcare.

5.6 Reproducible Research

Research is considered to be reproducible when the exact results can be reproduced if given access to the original data, software, or code.

  • The same results should be obtained under the same conditions
  • It should be possible to recreate the same conditions

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results. Reproducibility is a minimum necessary condition for a finding to be believable and informative. — U.S. National Science Foundation (NSF) subcommittee on Replicability in Science

There are four key elements of reproducible research:

  • data documentation
  • data publication
  • code publication
  • output publication

Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016)

Flavours of Reproducible Research

Factors behind irreproducible research

  • Not enough documentation on how experiment is conducted and data is generated
  • Data used to generate original results unavailable
  • Software used to generate original results unavailable
  • Difficult to recreate software environment (libraries, versions) used to generate original results
  • Difficult to rerun the computational steps

Threats to Reproducibility (Munafo. et. al, 2017)

While reproducibility is the minimum requirement and can be solved with “good enough” computational practices, replicability/ robustness/ generalisability of scientific findings are an even greater concern involving research misconduct, questionable research practices (p-hacking, HARKing, cherry-picking), sloppy methods, and other conscious and unconscious biases.

What are the good practices of reproducible research?

How to make your work reproducible?

Reproducible workflows give you credibility!

Cartoon created by Sidney Harris (The New Yorker)

Reproducibility spectrum for published research. Source: Peng, RD Reproducible Research in Computational Science Science (2011)

5.7 Getting Comfortable with R and RStudio

5.7.1 Install R

  1. Go here: https://cran.rstudio.com/

  2. Choose the correct “Download R for. . .” option from the top (probably Windows or macOS), then…

  1. For Windows users, choose “Install R for the first time” (next to the base subdirectory) and then “Download R 4.4.2 for Windows”

  2. For macOS users, select the appropriate version for your operating system (e.g. the latest release is version 4.4.2, will look something like R-4.4.2-arm64.pkg), then choose to Save or Open

  3. Once downloaded, save, open once downloaded, agree to license, and install like you would any other software.

If it installs, you should be able to find the R icon in your applications.

5.7.2 Install RStudio

RStudio is a user-friendly interface for working with R. That means you must have R already installed for RStudio to work. Make sure you’ve successfully installed R in Step 1, then. . .

  1. Go to https://www.rstudio.com/products/rstudio/download/ to download RStudio Desktop (Open Source License). You’ll know you’re clicking the right one because it says “FREE” right above the download button.

  2. Click download, which takes you just down the page to where you can select the correct version under Installers for Supported Platforms (almost everyone will choose one of the first two options, RStudio for Windows or macOS).

  3. Click on the correct installer version, save, open once downloaded, agree to license and install like you would any other software. The version should be at least RStudio 2024.09 “Cranberry Hibiscus”, 2024.

If it installs, you should be able to find the RStudio icon in your applications.

5.8 Understanding the RStudio environment

5.8.1 Pane layout

The RStudio environment consist of multiple windows. Each window consist of certain Panels

Panels in RStudio

  1. Source
  2. Console
  3. Environment
  4. History
  5. Files
  6. Plots
  7. Connections
  8. Packages
  9. Help
  10. Build
  11. Tutorial
  12. Viewer

It is important to understand that not all panels will be used by you in routine as well as by us during the workshop. The workshop focuses on using R for healthcare professionals as a database management, visualization, and communication tool. The most common panels which requires attention are the source, console, environment, history, files, packages, help, tutorial, and viewer panels.

5.8.2 A guided tour

You are requested to make your own notes during the workshop. Let us dive deep into understanding the environment further in the workshop.

5.8.3 File types in R

The most common used file types are

  1. .R : Script file
  2. .Rmd : RMarkdown file
  3. .qmd : Quarto file
  4. .rds : Single R database file
  5. .RData : Multiple files in a single R database file

5.8.4 Programming basics.

R is easiest to use when you know how the R language works. This section will teach you the implicit background knowledge that informs every piece of R code. You’ll learn about:

  1. Functions and their arguments
  2. Objects
  3. R’s basic data types
  4. R’s basic data structures including vectors and lists
  5. R’s package system

5.8.5 Functions and their arguments.

To do anything in R, we call functions to work for us. Take for example, we want to compute square root of 5197. Now, we need to call a function sqrt() for the same.

sqrt(5197)
[1] 72.09022

Important things to know about functions include:

  1. Code body.

Typing code body and running it enables us understand what a function does in background.

sqrt
function (x)  .Primitive("sqrt")
  1. Run a function.

To run a function, we need to add a parenthesis () after the code body. Within the parenthesis we add the details such as number in the above example.

  1. Help page.

Placing a question mark before the function takes you to the help page. This is an important aspect we need to understand. When calling help page parenthesis is not placed. This help page will enable you learn about new functions in your journey!

?sqrt 

Tip:

Annotations are meant for humans to read and not by machines. It enables us take notes as we write. As a result, next time when you open your code even after a long time, you will know what you did last summer :)


Arguments are inputs provided to the function. There are functions which take no arguments, some take a single argument and some take multiple arguments. When there are two or more arguments, the arguments are separated by a comma.

# No argument
Sys.Date()
[1] "2024-11-06"
# One argument
sqrt(5197)
[1] 72.09022
# Two arguments
sum(2,3)
[1] 5
# Multiple arguments
seq(from=1,
    to = 10, 
    by  = 2)
[1] 1 3 5 7 9

Matching arguments: Some arguments are understood as such by the software. Take for example, generating a sequence includes three arguments viz: from, to, by. The right inputs are automatically matched to the right argument.

seq(1,10,2)
[1] 1 3 5 7 9

Caution: The wrong inputs are also matched. Best practice is to be explicit at early stages. Use argument names!

seq(2,10,1)
[1]  2  3  4  5  6  7  8  9 10
seq(by = 2,
    to = 10,
    from = 1)
[1] 1 3 5 7 9

Optional arguments: Some arguments are optional. They may be added or removed as per requirement. By default these optional arguments are taken by R as default values. Take for example, in sum() function, na.rm = FALSE is an optional argument. It ensures that the NA values are not removed by default and the sum is not returned when there are NA values. These optional arguments can be override by mentioning them explicitly.

sum(2,3,NA)
[1] NA
sum(2,3,NA, na.rm = T)
[1] 5

In contrast, the arguments which needs to be mentioned explicitly are mandatory! Without them, errors are returned as output.

sqrt()

5.8.6 Objects.

If we want to use the results in addition to viewing them in console, we need to store them as objects. To create an object, type the name of the object (Choose wisely, let it be explicit and self explanatory!), then provide an assignment operator. Everything to the right of the operator will be assigned to the object. You can save a single value or output of a function or multiple values or an entire data set in a single object.

# Single value
x <- 3
x
[1] 3
# Output from function
x <- seq(from=1,
    to = 10, 
    by  = 2)
# Better name:
sequence_from_1_to_10 <- seq(from=1,
    to = 10, 
    by  = 2)

Creating an object helps us in viewing its contents as well make it easier to apply additional functions

Tip. While typing functions/ object names, R prompts are provided. Choose from the prompts rather than typing the entire thing. It will ease out many things later!

sequence_from_1_to_10
[1] 1 3 5 7 9
sum(sequence_from_1_to_10)
[1] 25

5.8.7 Vectors

R stores values as a vector which is one dimensional array. Arrays can be two dimensional (similar to excel data/ tabular data), or multidimensional. Vectors are always one dimensional!

Vectors can be a single value or a combination of values. We can create our own vectors using c() function.

single_number <- 3
single_number
[1] 3
number_vector <- c(1,2,3)
number_vector
[1] 1 2 3

Creating personalized vectors is powerful as a lot of functions in R takes vectors as inputs.

mean(number_vector)
[1] 2

Vectorized functions: The function is applied to each element of the vector:

sqrt(number_vector)
[1] 1.000000 1.414214 1.732051

If we have two vectors of similar lengths (such as columns of a research data), vectorised functions help us compute for new columns by applying the said function on each element of both the vectors and give a vector of the same length (Consider this as a new column in the research data)

number_vector2 <- c(3,-4,5.4)
number_vector + number_vector2
[1]  4.0 -2.0  8.4

5.8.8 Data Types

R recognizes different types of vectors based on the values in the vector.

If all values are numbers (positive numbers, negative numbers, decimals), R will consider that vector as numerical and allows you to carry out mathematical operations/ functions. You can find the class of the vector by using class() function.R labels these vectors as “double”, “numeric”, or “integers”.

class(number_vector)
[1] "numeric"
class(number_vector2)
[1] "numeric"

If the values are within quotation marks, it is character variable by default. It is equivalent to nominal variable.

alphabets_vector <- c("a", "b", "c")
class(alphabets_vector)
[1] "character"
integer_vector <- c(1L,2L)
class(integer_vector)
[1] "integer"

Logical vectors contain TRUE and FALSE values

logical_vector <- c(TRUE, FALSE)
class(logical_vector)
[1] "logical"

Factor vectors are categorical variables. Other variable types can be converted to factor type using functionfactor()

factor_vector <- factor(number_vector)
factor_vector
[1] 1 2 3
Levels: 1 2 3

We can add labels to factor vectors using optional arguments

factor_vector <- factor(number_vector,
                        levels =c(1,2,3),
                        labels = c("level1", 
                                   "level2", 
                                   "level3"))
factor_vector
[1] level1 level2 level3
Levels: level1 level2 level3

One vector = One type. For example: When there is mix of numbers and characters, R will consider all as character.

mix_vector <- c(1,"a")
class(mix_vector)
[1] "character"

Note that the number 1 has been converted into character class.

mix_vector[1]
[1] "1"
mix_vector[1] |> class()
[1] "character"

Double, character, integer, logical, complex, raw, dates, etc… There are many other data types and objects but for now, lets start with these. You will understand additional types as you will proceed in your R journey!

5.8.9 Lists

In addition to vectors, lists are another powerful objects. A list can be considered as a vector of vectors!! They enable you to store multiple types of vectors together. A list can be made using a list() function. It is similar to c() function but creates a list rather than a vector. It is a good practice to name the vectors in the list.

example_list <- list(numbers = number_vector, 
                     alphabets = alphabets_vector)
class(example_list)
[1] "list"
example_list
$numbers
[1] 1 2 3

$alphabets
[1] "a" "b" "c"

The elements of a named list/ a named vector can be called by using a $.

example_list$numbers
[1] 1 2 3

5.8.10 Packages

There are thousands of functions in R. To be computationally efficient, R do not load all functions on start. It loads only base functions. As you want to use additional functions, we need to load the packages using library() function.

The additional packages are installed once but loaded everytime you start R sessions.

With these basics, lets deep dive into the workshop!! Are you ready?

5.9 Exploring Data with R

To recap what we learnt in the previous sessions.. we now know to work within the R Project environment. here::here() makes it easy for us to manage file paths. You can quickly have a look at your data using the View() and glimpse() functions. Most of the tidy data is read as tibble which is a workhorse of tidyverse.

It is here::here() is better than setwd()

here::here() allows us to filepaths very easily

5.10 Getting Started with the Data Exploration Pipeline

5.10.1 Set-up

#install.packages("pacman")


pacman::p_load(tidyverse, here)

#tidyverse required for tidy workflows
#rio required for importing and exporting data
#here required for managing file paths

Note

The shortcut for code commenting is Ctrl+Shift+C.

5.10.2 Load Data

The dataset we will be working with has been cleaned (to an extent) for the purposes of this workshop. It is a dataset about NHANES that has been took from the NHANES and cleaned up and modified for our use.

# Check the file path
here::here("data", "nhanes_basic_info.csv")
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_basic_info.csv"
# Read Data
df <- read_csv(here("data", "nhanes_basic_info.csv"))

Try the following functions using tb as the argument:

  • glimpse()
  • head()
  • names()

Now, we will be introducing you to two new packages:

  1. dplyr
  2. skimr
  3. DataExplorer

5.11 dplyr Package

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

There are many verbs in dplyr that are useful, some of them are given here…

Important functions of the dplyr package to remember

Syntax structure of the dplyr verb

5.11.1 Getting used to the pipe |> or %>%

The pipe operator in dplyr

Note

The pipe |> means THEN…

The pipe is an operator in R that allows you to chain together functions in dplyr.

Let’s find the bottom 50 rows of tb without and with the pipe.

Tips The native pipe |> is preferred.

#without the pipe
tail(df, n = 50)

#with the pipe
df |> tail(n = 50)

Now let’s see what the code looks like if we need 2 functions. Find the unique age in the bottom 50 rows of df

#without the pipe
unique(tail(df, n = 50)$age)

# with the pipe
df |> 
  tail(50) |>
  distinct(age)

Note

The shortcut for the pipe is Ctrl+Shift+M

You will notice that we used different functions to complete our task. The code without the pipe uses functions from base R while the code with the pipe uses a mixture (tail() from base R and distinct() from dplyr). Not all functions work with the pipe, but we will usually opt for those that do when we have a choice.

5.11.2 distinct() and count()

The distinct() function will return the distinct values of a column, while count() provides both the distinct values of a column and then number of times each value shows up. The following example investigates the different race (race) in the df dataset:

df |> 
  distinct(race) 

df |> 
  count(race)

Notice that there is a new column produced by the count function called n.

5.11.3 arrange()

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.

The following code would get the number of times each race is in the dataset:

df |> 
  count(race) |> 
  arrange(n)

# Since the default is ascending order, 
# we are not getting the results that are probably useful, 
# so let's use the desc() function
df |> 
  count(race) |> 
  arrange(desc(n))

# shortcut for desc() is -
df |> 
  count(race) |> 
  arrange(-n)

5.11.4 filter()

If you want to return rows of the data where some criteria are met, use the filter() function. This is how we subset in the tidyverse. (Base R function is subset())

Here are the logical criteria in R:

  • ==: Equal to
  • !=: Not equal to
  • >: Greater than
  • >=: Greater than or equal to
  • <: Less than
  • <=: Less than or equal to

If you want to satisfy all of multiple conditions, you can use the “and” operator, &.

The “or” operator | (the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.

Let’s see all the data for age 60 or above

df |> 
  filter(age >= 60)

Let’s just see data for white

df |> 
  filter(race == "White")

Both White and age 60 or more

df_60_plus_white <- df |> 
  filter(age >= 60 & race == "White")

5.11.5 %in%

To filter() a categorical variable for only certain levels, we can use the %in% operator.

Lets check which are the race groups that are in the dataset.

df |> 
  select(race) |> 
  unique()
# A tibble: 5 × 1
  race    
  <chr>   
1 White   
2 Mexican 
3 Hispanic
4 Other   
5 Black   

Now we’ll create a vector of races we are interested in

others <- c("Mexican", 
              "Hispanic", 
              "Other")

And use that vector to filter() df for races %in% minority

df |> 
  filter(race %in% others)

You can also save the results of a pipeline. Notice that the rows belonging to minority races are returned in the console. If we wanted to do something with those rows, it might be helpful to save them as their own dataset. To create a new object, we use the <- operator.

others_df <- df |> 
  filter(race %in% others)

5.11.6 drop_na()

The drop_na() function is extremely useful for when we need to subset a variable to remove missing values.

Return the NHANES dataset without rows that were missing on the education variable

df |> 
  drop_na(education)

Return the dataset without any rows that had an NA in any column. *Use with caution because this will remove a lot of data

df |> 
  drop_na()

5.11.7 select()

Whereas the filter() function allows you to return only certain rows matching a condition, the select() function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.

See just the country, year, incidence_100k columns

# list the column names you want to see separated by a comma

df |>
  select(id, age, education)

Use the - sign to drop these same columns

df |>
  select(-age_months, -poverty, -home_rooms)

5.11.8 select() helper functions

The starts_with(), ends_with() and contains() functions provide very useful tools for dropping/keeping several variables at once without having to list each and every column you want to keep. The function will return columns that either start with a specific string of text, ends with a certain string of text, or contain a certain string of text.

# these functions are all case sensitive
df |>
  select(starts_with("home"))

df |>
  select(ends_with("t"))

df |>
  select(contains("_"))

# columns that do not contain -
df |>
  select(-contains("_"))

5.11.9 summarize()

The summarize() function summarizes multiple values to a single value. On its own the summarize() function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n() and n_distinct() that tell you the number of observations or the number of distinct values of a particular variable.

Note summarize() is the same as summarise()

Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column.

df |>
  summarize(mean(age))

# watch out for nas. Use na.rm = TRUE to run the calculation after excluding nas.

df |>
  summarize(mean(weight, na.rm = TRUE))

The name of the column is the expression used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something better.

df |>
  summarize(mean_age = mean(age, na.rm = TRUE))

5.11.10 group_by()

We saw that summarize() isn’t that useful on its own. Neither is group_by(). All this does is takes an existing data frame and converts it into a grouped data frame where operations are performed by group.

df |>
  group_by(gender) 

df |>
  group_by(gender, race)

5.11.11 group_by() and summarize() together

The real power comes in where group_by() and summarize() are used together. First, write the group_by() statement. Then pipe the result to a call to summarize().

Let’s summarize the mean incidence of tb for each year

df |>
  group_by(race) |>
  summarize(mean_height = mean(height, na.rm = TRUE))

#sort the output by descending mean_inc
df |>
  group_by(race) |>
  summarize(mean_height = mean(height, na.rm = TRUE))|>
  arrange(desc(mean_height))

5.11.12 mutate()

Mutate creates a new variable or modifies an existing one.

Lets create a column called elderly if the age is greater than or equal to 65.

df |>
  mutate(elderly = if_else(
    age >= 65,
    "Yes", 
    "No"))

The same thing can be done using case_when().

df |>
  mutate(elderly = case_when(
    age >= 65 ~ "Yes",
    age < 65 ~ "No",
    TRUE ~ NA))

Lets do it again, but this time let us make it 1 and 0, 1 if age is greater than or equal to 65, 0 if otherwise.

df |>
  mutate(old = case_when(
    age >= 65 ~ 1,
    age < 65 ~ 0,
    TRUE ~ NA))

Note

The if_else() function may result in slightly shorter code if you only need to code for 2 options. For more options, nested if_else() statements become hard to read and could result in mismatched parentheses so case_when() will be a more elegant solution.

As a second example of case_when(), let’s say we wanted to create a new income variable that is low, medium, or high.

See the income_hh broken into 3 equally sized portions

quantile(df$income_hh, prob = c(.33, .66), na.rm = T)

Note

See the help file for quanile function or type ?quantile in the console.

We’ll say:

  • low = 30000 or less
  • medium = between 30000 and 70000
  • high = above 70000
df |>
  mutate(income_cat = case_when(
    income_hh <= 30000 ~ "low",
    income_hh > 2043237 & income_hh <= 11379155 ~ "medium",
    TRUE ~ "high"))

5.11.13 join()

Typically in a data science or data analysis project one would have to work with many sources of data. The researcher must be able to combine multiple datasets to answer the questions he or she is interested in. Collectively, these multiple tables of data are called relational data because more than the individual datasets, its the relations that are more important.

As with the other dplyr verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.

Different type of joins, represented by a series of Venn Diagram

These include:

  • left_join(x, y) which combines all columns in data frame x with those in data frame y but only retains rows from x.

  • right_join(x, y) also keeps all columns but operates in the opposite direction, returning only rows from y.

  • full_join(x, y) combines all columns of x with all columns of y and retains all rows from both data frames.

  • inner_join(x, y) combines all columns present in either x or y but only retains rows that are present in both data frames.

  • anti_join(x, y) returns the columns from x only and retains rows of x that are not present in y.

  • anti_join(y, x) returns the columns from y only and retains rows of y that are not present in x.

Visual representation of the join() family of verbs

Apart from specifying the data frames to be joined, we also need to specify the key column(s) that is to be used for joining the data. Key columns are specified with the by argument, e.g. inner_join(x, y, by = "subject_id") adds columns of y to x for all rows where the values of the “subject_id” column (present in each data frame) match. If the name of the key column is different in both the dataframes, e.g. “subject_id” in x and “subj_id” in y, then you have to specify both names using by = c("subject_id" = "subj_id").

Example

Lets try to join the basic information dataset (nhanes_basic_info.csv) with clinical dataset (nhanes_clinical_info.rds).

basic <- read_csv(
  here("data", 
       "nhanes_basic_info.csv"))
Rows: 5679 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): gender, race, education, marital_status, home_own, work, bmi_who
dbl (7): unique_id, age, income_hh, poverty, home_rooms, height, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
clinical <- read_rds(
  here("data", 
       "nhanes_clinical_info.rds"))

df <- basic |> 
  left_join(clinical)
Joining with `by = join_by(unique_id)`

Try to join behaviour dataset (nhanes_behaviour_info.rds).

5.11.14 pivot()

Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot family of functions to achieve this task. What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:

Long and Wide format of our data
  • “Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.

  • “Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).

Let us now use the pivot functions to reshape the data in practice. The two pivot functions are:

  • pivot_wider(): from long to wide format.
  • pivot_longer(): from wide to long format.

Lets try pivot_longer. Suppose we need a long data format for the bp_sys and bp_sys_post variables:

df_long <- df |> 
  pivot_longer(
    cols = c(bp_sys, bp_sys_post),
    names_to = "bp_sys_cat",
    values_to = "bp_value")

Lets try pivot_wider. Suppose we need a wide data format for height variable based on race variable.

df_wider <- df |> 
  pivot_wider(names_from = "race",
              values_from = "height",
              names_prefix = "height_")

Resources for learning more dplyr

  • Check out the Data Wrangling cheatsheet that covers dplyr and tidyr functions.(https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

  • Review the Tibbles chapter of the excellent, free R for Data Science book.(https://r4ds.had.co.nz/tibbles.html)

  • Check out the Transformations chapter to learn more about the dplyr package. Note that this chapter also uses the graphing package ggplot2 which we have covered yesterday.(https://r4ds.had.co.nz/transform.html)

  • Check out the Relational Data chapter to learn more about the joins.(https://r4ds.had.co.nz/relational-data.html)

5.12 skimr Package

skimr is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The core function of skimr is the skim() function, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible.

Give skim() a try.

df |> 
  skimr::skim()

Check out the names of the output of skimr

df |> 
  skimr::skim() |> 
  names()

Also works with dplyr verbs

df |> 
  group_by(race) |> 
  skimr::skim()
df |> 
  skimr::skim() |>
  dplyr::select(skim_type, skim_variable, n_missing)

5.13 DataExplorer Package

The DataExplorer package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.1

The single most important function from the DataExplorer package is create_report()

Try it for yourself.

pacman::p_load(DataExplorer)

create_report(df)

  1. DataExplorer Package↩︎